Thermodynamic stability of folded proteins against mutations 
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By balancing the average energy gap with its typical change due to mutations for protein-like 
heteropolymers with M residues, we show that native states are unstable to mutations on a scale 



(\/c 



where A is the dispersion in the interaction free energies and ct^ their typi- 



cal change. Theoretical bounds and numerical estimates (based on complete enumeration on four 
lattices) of the instability exponent C,b are given. Our analysis suggests that a limiting size of 
single-domain proteins should exist, and leads to the prediction that small proteins are insensitive 
to random mutations. 



In order for a protein to perform a biological activity 
it has to fold to a structure with a well-defined topology. 
It is widely believed that this native state has the lowest 
free energy 0| . It is less well appreciated that the folded 
state of a single-domain protein is only marginally sta- 
ble compared to the ensemble of unfolded structures . 
Thus the polypeptide chain can undergo large amplitude 
fluctuations after reaching the native conformation. An- 
other characteristic of single-domain proteins is that they 
are small: the number of amino acid residues in single- 
domain proteins seldom exceeds 200. Larger proteins 
usually self-organize into multi-domain structures 

The marginal stability of single-domain proteins sug- 
gests that polypeptide chains may be sensitive to muta- 
tions that alter the primary sequence of amino acids. On 
the other hand, many different sequences often lead to 
similar native states, i.e., sequence homology gives rise to 
a class of folded states which are topologically similar . 
Thus, from the perspective of homologous proteins we ex- 
pect that point mutations may have little effect on the 
stability of the folded states of proteins. These observa- 
tions raise the question: to what extent are native states 
of proteins stable when subjected to random mutations? 

Experimentally, the stability of folded states against 
random uncorrelated mutations can only be examined 
by synthesizing a library of mutated sequences starting 
from a known structure In this article we investi- 
gate the stability of folded states using a combination of 
theoretical arguments and exact enumeration for various 
lattice models of proteins. 

Since our arguments are general, we can only quantify 
the effect of mutations in statistical terms. In particular, 
we have analyzed how the probability that a mutation 
causes a change in the native state of a protein depends 
on the number of residues. We assume that the con- 
formations in the native and first excited state are not 
structurally related, so that it is reasonable to focus on 
these two states to estimate the effect of random mu- 
tations. For the simpified models discussed below we 
have verified this assumption by explicit computation of 
the overlap between the two states for a number of se- 
quences. For real proteins the gap would correspond to 



the free energy difference between the native basin of 
attraction (NBA) and the closest competing basin of at- 
traction (CBA); the structures in the NBA and CBA are 
expected to be dissimilar as well. 

The main result of our analysis is that the energy gap 
Aav, averaged over a large number of sequences, and the 
typical change oa in the energy gap due to mutations 
scale with the number of residues M as 



A, 



(1) 



with a, > 0, where A is the dispersion of the interaction 
free energies along the chain, and is the typical change 
in interaction free energy for each residue due to muta- 
tions. Since Aav decreases with M while oa increases, it 
follows that there is a crossover length M* , scaling with 
the instability exponent Q = oi + 9 as 



M* 



(A/a,)i/^= 



(2) 



The native state is robust with respect to mutations if 
AI ^ M*, but there is a considerable chance that the 
native state is destabilized by mutations if M > M*. 

An estimate for a defined in Eq. (|^) goes as follows. 
Let rtA denote the number of contacts that are different 
between the ground state and the first excited state. The 
independent mutations in the interaction free energies, of 
typical size cr^ for each residue, combine to give a change 
in the gap energy of magnitude oa ~ o'^^/nK- Assuming 
that the ua different contacts are located on a fractal 
surface, we obtain the inequality a > {d — l)/2d. The 
maximum possible number of contacts, which can only 
occur for d = oo, is n^^"" = AP. Thus (d-1) /2d < a < 1. 

To motivate why Aav decreases algebraically with M 
consider an ensemble of n random energies, drawn from 
a Gaussian distribution f{E) oc exp(— i?^/2). The dis- 
tribution function for the energy gap A between the 
two smallest values among the n samples is given by 
/(A) (X nJZ,dEfiE)[g{E + A)r-\ where giE) = 
dE' f{E') equals the probability that the energy is 
larger than E. It turns out (see also below) that 
/(A) is very well described by an exponential distri- 
bution function, /(A) = (1/Aav) exp(— A/Aav)- Con- 
sequently, we have 1/Aav — — (d/dA) log/(A)|A=o = 



1 



-n jT^dE E f{E)[g{E)Y'-'^ . By numerically evaluat- 
ing this integral for various values of n we find that for 
large n the average gap decreases as Aav ^ (logn)^^ 
with 9 ~ 0.2 in the range 10 < n < 10^. Since different 
conformations may have considerable overlap, the effec- 
tive number n of independent samples is smaller than 
the total number of conformations, which grows expo- 
nentially with M . But it is reasonable to assume that n 
still depends exponentially on the number of residues, so 
that n ^ c*^ gives rise to Aav . This power law 

behavior is independent of the constant c. 

To numerically verify the scaling behavior of Eq. (|^), 
which leads to the length scale M* (cf. Eq. (|2|)), we have 
used simplified lattice models for proteins |4| that allow 
for complete enumeration of all conformations for chains 
of moderate length. In these models, a protein is re- 
presented by a self-avoiding walk on a lattice. Natural 
proteins are made from twenty amino acids, roughly half 
of which are hydrophobic. Contacts between hydropho- 
bic residues which form the core of the protein are fa- 
vored. To mimic the diversity of natural proteins we 
use a model in which each residue is assigned a random 
"charge" , \j , drawn from a Gaussian distribution with 
average (Aj) = Aq and variance ^ {{Xj — Aq)^) = A 
Positive Aj's may be identified as hydrophilic, whereas 
negative values of Xj correspond to hydrophobic residues. 
As a typical value for the ratio of the average hydropho- 
bicity and its dispersion we use Aq/A = —0.1. This choice 
is consistent with the fact that in nature approximately 
55% of the residues are hydrophobic We have used 
this value in most of our calculations, but as will be 
shown below, the results are rather insensitive to Aq in 
the range — 1 < Aq/A < 0. 

We use four different lattices: the square and the tri- 
angular lattice in two dimensions, and the diamond and 
the cubic lattice in three dimensions. Consider a confor- 
mation v in which the M residues in the chain reside on 
lattice sites rj, with j labeled 1 through M. By defini- 
tion, a contact occurs between two residues j and k if 
they are nearest neighbors on the lattice. In terms of a 
(symmetric) contact matrix C a contact is represented by 

1 are always near- 



CJj. = 1. Subsequent residues j and j 
est neighbors on the lattice, and therefore 



0. Wc 



also define Cj; = 0. 

When two residues j and k are in contact, they give 
a contribution to the total energy E"^ that is by defini- 
tion equal to the average, Ejk = {Xj + Afc)/2, of their 
"charges". Thus, the total energy of a conformation is 
given by 



Aj + A;, 



A, 



(3) 



"^k^j ^jk equals the number of contacts in 
For end point residues, no- 
where z is the coordination 
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which residue j is involved, 
takes values from to z — 1, 
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FIG. 1. Distribution of energy gap A between native 
and first excited state for the diamond lattice model with 
Ao — —0.1 and M — 18, obtained using an ensemble of 10^ 
sequences. The solid line denotes an exponential fit to the 
data shown as filled circles. The other symbols correspond to 
a restriction to specific values of ua- The percentages indicate 
the fraction of sequences for which each ua occurs. 



number of the lattice; for internal residues the maximum 
number of contacts equals z — 2. 

It is important, both from a theoretical and a compu- 
tational point of view, that the energy of a conformation 
v depends on the coordinates {rj} only through the con- 
tact matrix {Cj'j,}, and in fact an even more reduced 
description in terms of {rij} is possible. Therefore, as 
long as we are interested in thermodynamic rather than 
kinetic properties, it is sufficient to consider all possible 
contact sets {n-J}- For each value of M we enumerated all 
possible conformations {rj}, determining {nj} for each 
conformation, and only adding it to the list if it is differ- 
ent from all previously generated contact sets. 

The list thus generated was used to find the energies of 
the native and the first excited state for a large number 
of random sequences. Many different spatial conforma- 
tions {rj} can give rise to the same set of pair contacts 
{n-j}, which makes our approach very efficient. To quan- 
tify this, we first note that the total number of differ- 
ent conformations. Cm, grows exponentially with M, as 
Cm ~ z^g, where z^e is an effective coordination number 
. The number of contact sets also grows exponen- 
tially with M, but with an effective coordination number 
z'^g < Zes, so that C'j^j/Cm ~ (^cffZ-^cft)*^- On the cubic 
lattice Zes ^ 4.684 while z'^g ~ 3.4, and for M = 15 
the contact set enumeration is already more efficient by 
a factor of 400. The description of the energies in terms 
of the variables {rij} therefore allows us to calculate the 
low energy states for a much larger number of sequences 
than would otherwise have been possible. 

Consider the ground state {v — 0) and first excited 
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state {i' = 1) for a given sequence. The energy gap, 
A = — E'^, is defined as the energy difference between 
the native state and the first excited state. A histogram 
of A, obtained by determining the native state and first 
excited state for 10^ randomly drawn sequences {Xj} fits 
very well to the exponential distribution (see Fig. 

The mutations are modeled by adding a small pertur- 
bation to the "charges" , Xj Xj + fj.j , where ^ij is a 
Gaussian random variable with zero mean and variance 
a^. Other ways of simulating the effect of mutations in 
lattice models have also recently been considered Q . The 
effect of a mutation depends on the degree of overlap be- 
tween {n'j} and {"-]}■ Mutations affecting contacts that 
occur in both the native state and the first excited state 
have no influence on the energy gap A, since E*^ and E^ 
are changed by the same amount. When a contact is not 
shared between the two states however, mutations will 
affect the energies of both states differently. To quantify 
the effect of the mutations in terms of {n^} and we 
consider the effective number of different contacts, 

M 

nA=J2{n]-n°^f. (4) 

As a result of the mutations the energy gap is changed by 
an amount equal to the sum of independent Gaussian 
variables /ij/2, A A -I- /iA, where /iA is a Gaussian 
variable with zero mean (this is a direct consequence of 
the fact that (/ij) — 0) and variance oa = 0/i(v^^iA)/2. 
The values of nj^ for an ensemble of sequences {A^} are 
exponentially distributed for large enough M. 

Since A and oa are both derived from the same native 
state and first excited state, it is important to analyze 
to what extent they are correlated. To this end we have 
plotted in Fig. |l| the distribution of the energy gap A for 
subsets of sequences for which ua has a particular value. 
It can be seen that the distribution is essentially inde- 
pendent of riA- Therefore we are justified in considering 
A and ua, or equivalently A and cta, to be independent. 

Figure ^ shows how Aav and oa depend on the number 
of residues for the square and the diamond lattice mod- 
els: oa/o/j is an increasing function of M, while Aav/A 
decreases with M. This behavior holds qualitatively for 
all four lattices studied. Power law fits to the data for 
large M, shown as solid lines in Fig. ||, provide estimates 
for the exponents a and 9 defined in Eq. (|^) for various 
values of Ao/A. In the range —0.5 < Aq/A < there is no 
significant dependence on Aq/A. For Aq/A = 0.5, when 
the chain is largely hydrophilic, a and 9 have different 
values, although their sum (and consequently the insta- 
bility exponent Q — + 9) hardly changes. However, in 
the limit Aq/A — + —oo, where only maximally compact 
states contribute, the scaling behavior breaks down com- 
pletely, and the dependence on M is quite erratic for the 
whole range of M values considered. 
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FIG. 2. Dependence of (a) the average energy gap Aav, 
and (b) the typical change in the gap due to mutations oa, 
on the number of residues M for the diamond lattice model. 
The insets show the same data for the square lattice model. 
Different symbols denote various values of Aq/A, from top to 
bottom: —oo, -1.0, -0.5, -0.1, and 0.5. The data were obtained 
by averaging over 40, 000 random sequences. 
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On all four lattices we find a = 0.3 ±0.1, which is con- 
sistent with the theoretical bound {d — l)/2d < a < 1 in 
both two and three dimensions. We find 6 — 0.6 ± 0.2 
in two dimensions (square and triangular lattice) and 
9 — 1.0 ± 0.2 in three dimensions (diamond and cubic 
lattice). Both these values are significantly larger than 
the theoretical prediction 6 ~ 0.2. It is quite possible 
that in the range of M values that allow for complete 
enumeration, 6 is enhanced by finite-chain-length effects, 
and that for larger M the decay of Aav /A is slower and 
6 reaches its predicted value 9 ~ 0.2. But for M* as 
defined in Eq. (^ to exist we only require that 9 > 0. 
The values of the instability exponent Q ^ ct + S ior the 
various lattices are 0.9 ±0.1 (square), 1.0 ±0.2 (triangu- 
lar), 1.2 ± 0.2 (diamond), and 1.3 ± 0.3 (cubic). A more 
accurate determination of the exponents would require 
much larger values of M. 

In Ref. [H it was shown that an energy function like 
the one defined in Eq. (||) — i.e. a sum of one-residue 
properties rather than a sum of pair interactions — pro- 
vides a good parametrization of the interaction matrix 
that was constructed by Miyazawa and Jernigan [Tc| ] by 
statistically analyzing a large number of known protein 
structures obtained from the Protein Data Bank. To ver- 
ify this claim explicitly we have performed calculations 



using the interaction matrix of Ref. |10| and find values 
for the exponents that are in the same approximate range 
as the values obtained using the "charge" model ||ll| . 

In obtaining the above results for random heteropoly- 
mers we have averaged over all sequences without regard 
to the degeneracy of the ground state. Since the native 
states of proteins are thought to be unique , we have re- 
peated our calculations by including only those sequences 
that have a non-degenerate native state. Surprisingly, the 
values of a and 9, and hence of the instability exponent 
Cs, are virtually unchanged. 

It might be argued that for our results to be applicable 
to proteins, which in the course of evolution may have 
been "optimized" (e.g. with respect to folding kinetics 
) , one must only include "optimized" sequences in the 
averaging process. Although this may change the value 
of the instability exponent we expect that a and 9 
will both remain positive. This expectation is reasonable 
given the fact that our results are virtually unchanged 
when we consider only non-degenerate ground states, as 
mentioned above. Thus we expect our conclusion that 
the native state is unstable against weak mutations for 
sufficiently long chains also to hold for (single-domain) 
proteins. 

Since in biological systems mutations occur at a given 
rate and the dispersion in the interaction free energies 
of the residues has a limited range of values , the fact 
that single-domain proteins with more than about 200 
residues rarely occur may be explained by the sensitivity 
of the native state to small random perturbations when 
the number of residues is large. To make this more pre- 



cise, consider pd = da dAf{A\Aa^)g{a\aA), i.e., 
the probability that the folded state is destabilized by the 
mutations, which only depends on the ratio x = a^/ A^^-v 
Here /(A|Aav) is an exponential distribution with av- 
erage Aav and (7((t|(Ta) is a Gaussian distribution with 
zero mean and width oa- For small x we have pd ^ x 
so that using Eq. (^) we obtain p^ = c{M/M*Y'' for 
M ^ M*, with c a constant. Now let £ be a small 
threshold probability separating stable (M < M^) from 
unstable (M > Mg) proteins, where is defined as the 
value of M for which pd = e. Eliminating M* in favor of 
Mr we obtain 



e 



M 

m7 



(5) 



Given a pair {e^M^) we can therefore predict the stabil- 
ity for other chain lengths M < using Eq. (||), pro- 
vided that we know the value of the instability exponent 
Cs. Note that the ratio of the pd values for two different 
values of M is independent of e, and is thus a suitable 
quantity to be considered in experiments designed to ver- 
ify Eq. (H). Since Cs — 1, the destabilization probability 
Pd decreases more or less linearly with M. It follows 
that on average small proteins are relatively insensitive 
to point mutations. 

In order to utilize our theory for analyzing experi- 
ments, it should be kept in mind that the models pro- 
vide a coarse-grained description of a protein, where each 
bead corresponds to b amino acids, with 3 ^ 6 ^ 4 [p^ . 
In terms of the number of amino acids Maa and the free 
energy dispersion per amino acid Aaa, the corresponding 
model parameters are M = M^/b and A = AaaV^- The 
coarse-graining does not affect Eq. (||) . 

It is important to estimate the effect of thermal fluc- 
tuations on the predictions of our theory. According to 
Eq. (|l|), for large enough M we have Aav = caAM~^, 
with CA — 3 and ~ 1 obtained from the power law fits 
of Fig. Ij. Taking into account the coarse-graining factor 
b and estimating Aaa — (15-25) A:T, we find for medium- 
sized proteins with Maa — 100 that Aav — (3-5) kT. We 
conclude that thermal fluctuations do not signiflcantly 
affect our conclusions as summarized by Eq. (H). Note 
that the scaling of Aav with M suggests that smaller 
proteins in general have larger melting and denaturation 
temperatures fl . 

There are some striking similarities between our find- 
ings and certain results probing the chaotic nature of the 
spin-glass phase. Using an Imry-Ma style argument [p^ , 
Bray and Moore showed that in the ground state of 
a disordered Ising model the average excitation energy 
for a cluster of linear size L scales as L^, while the typ- 
ical energy associated with random bond perturbations 
scales as L"^^/^, with ds/2 > y- As a consequence, at suf- 
ficiently large length scales the ground state is unstable 
against arbitrarily weak perturbations to the bonds. In 
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this sense, the ground state of spin-glasses is "chaotic" 
p6|-|T8t , and by analogy the same could be said about the 
native state of a protein. 
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with M. Betancourt, D. Klimov, and A. Latz. This 
research was performed under NSF Grant No. CHE96- 
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